SiRen: Leveraging Similar Regions for Efficient & Accurate Variant Calling

نویسندگان

Kristal Curtis

Ameet Talwalkar

Matei Zaharia

Armando Fox

David A. Patterson

چکیده

Next-generation genomic sequencing costs are rapidly decreasing, having recently reached the $1000per-genome barrier, a likely tipping point for widespread clinical use. However, genomic analysis techniques have failed to keep pace. In particular, the process of variant calling, or inferring a sample genome from the noisy sequencing data, introduces major computational and statistical challenges. In this work, we explore the feasibility of a hybrid approach that addresses these challenges by partitioning the genome into easier and harder regions, deploying efficient algorithms on the easier regions, and relying on more expensive and accurate technologies in the harder regions. We propose that near duplication, or similarity, in the genome is a natural signal for identifying harder regions, and then present a large-scale distributed clustering approach to identify these similar regions. We perform an extensive empirical study illustrating the effectiveness of existing variant calling algorithms on the easier regions and their contrasting struggles on the similar regions. We also confirm that the similar regions are sufficiently disjoint, thus providing the opportunity for sophisticated analysis of these regions in an embarrassingly parallel manner.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

GW-CALL: Accurate Genome-Wide Variant Caller

The main challenge in reliable variant calling using DNA reads is to extract information from reads mappable to multiple locations on the reference genome. Conventional approaches ignore these reads and rely on reads mappable uniquely to the reference genome. These approaches fail to perform satisfactorily in variant calling within repeat regions which are abundant in many species including hom...

متن کامل

A statistical variant calling approach from pedigree information and local haplotyping with phase informative reads

MOTIVATION Variant calling from genome-wide sequencing data is essential for the analysis of disease-causing mutations and elucidation of disease mechanisms. However, variant calling in low coverage regions is difficult due to sequence read errors and mapping errors. Hence, variant calling approaches that are robust to low coverage data are demanded. RESULTS We propose a new variant calling a...

متن کامل

Variant Callers for Next-Generation Sequencing Data: A Comparison Study

Next generation sequencing (NGS) has been leading the genetic study of human disease into an era of unprecedented productivity. Many bioinformatics pipelines have been developed to call variants from NGS data. The performance of these pipelines depends crucially on the variant caller used and on the calling strategies implemented. We studied the performance of four prevailing callers, SAMtools,...

متن کامل

Leveraging Identity-by-Descent for Accurate Genotype Inference in Family Sequencing Data

Sequencing family DNA samples provides an attractive alternative to population based designs to identify rare variants associated with human disease due to the enrichment of causal variants in pedigrees. Previous studies showed that genotype calling accuracy can be improved by modeling family relatedness compared to standard calling algorithms. Current family-based variant calling methods use s...

متن کامل

Gappy Total Recaller: Efficient Algorithms and Data Structures for Accurate Transcriptomics

Understanding complex mammalian biology depends crucially on our ability to define a precise map of all the transcripts encoded in a genome, and to measure their relative abundances. A promising assay depends on RNASeq approaches, which builds on next generation sequencing pipelines capable of interrogating cDNAs extracted from a cell. The underlying pipeline starts with base-calling, collect t...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

SiRen: Leveraging Similar Regions for Efficient & Accurate Variant Calling

نویسندگان

چکیده

منابع مشابه

GW-CALL: Accurate Genome-Wide Variant Caller

A statistical variant calling approach from pedigree information and local haplotyping with phase informative reads

Variant Callers for Next-Generation Sequencing Data: A Comparison Study

Leveraging Identity-by-Descent for Accurate Genotype Inference in Family Sequencing Data

Gappy Total Recaller: Efficient Algorithms and Data Structures for Accurate Transcriptomics

عنوان ژورنال:

اشتراک گذاری